-
V1: 2022-05-19
-
V2: 2022-05-26
- Improved the section on keys (for ordering & filtering)
- Adding a section for translations of
Tidyr(and other similar packages)
- Capping tables to display 15 rows max when unfolded
- Improving table display (stripping, hiding the contents of nested columns, …)
- Improved the section on keys (for ordering & filtering)
-
V3: 2022-07-20
- Updating
data.table’s examples of dynamic programming usingenv
- Added new entries in processing examples
- Added new entries to Tidyr & Others: expand + complete, transpose/rotation, …
- Added
pivot_widerexamples to match thedcastones in the Pivots section
- Added some new examples here and there across the Basic Operations section
- Added an entry for operating inside nested data.frames/data.tables
- Added a processing example for run-length encoding (i.e. successive event tagging)
- Updating
-
V4: 2022-08-05
- Improved
pivotsection: example of one-hot encoding (and reverse operation) + better examples of partial pivots with.value
- Added
tidyr::uncount()(row duplication) example
- Improved both light & dark themes (code highlight, tables, …)
- Improved
-
V5: 2022-10-15
- Grouped the tabsets by framework
- Revamped the whole Basic Operations section (better structure, reworked examples, …)
- Updated
Tidyverseexamples todplyr1.1.0 & added examples of new functions & arguments (e.g.consecutive_id, …)
- Updated
data.tableexamples to version 1.14.5 & added examples of new functions & arguments (e.g.let,DT(), …)
- Updated
- Grouped the tabsets by framework
Setup
library(here) # Project management
library(data.table) # Data wrangling (>= 1.14.5)
library(dplyr) # Data wrangling (>= 1.1.0)
library(tidyr) # Data wrangling (extras) (>= 1.2.1)
library(pipebind) # Piping goodies (>= 0.1.1)
library(stringr) # Manipulating strings
library(purrr) # Manipulating lists
library(lubridate) # Manipulating dates
library(broom)
data.table::setDTthreads(parallel::detectCores(logical = TRUE))─ Session info ───────────────────────────────────────────────────────────────
setting value
version R version 4.2.1 (2022-06-23)
os Ubuntu 20.04.5 LTS
system x86_64, linux-gnu
ui X11
language (EN)
collate C.UTF-8
ctype C.UTF-8
tz Europe/Paris
date 2022-11-08
pandoc 2.19.2
Quarto 1.2.247
─ Packages ───────────────────────────────────────────────────────────────────
! package * version date (UTC) lib source
P broom * 1.0.1 2022-08-29 [?] CRAN (R 4.2.1)
P crayon * 1.5.2 2022-09-29 [?] CRAN (R 4.2.1)
P data.table * 1.14.5 2022-10-12 [?] Github (Rdatatable/data.table@052f8da)
P dplyr * 1.0.99.9000 2022-10-12 [?] Github (Tidyverse/dplyr@34981f9)
P ggplot2 * 3.3.6 2022-05-03 [?] CRAN (R 4.2.0)
P gt * 0.7.0 2022-08-25 [?] CRAN (R 4.2.1)
P gtExtras * 0.4.3 2022-10-08 [?] Github (jthomasmock/gtExtras@2b53a12)
P here * 1.0.1 2020-12-13 [?] CRAN (R 4.2.0)
P lubridate * 1.8.0 2021-10-07 [?] CRAN (R 4.2.0)
P pipebind * 0.1.1 2022-08-10 [?] CRAN (R 4.2.0)
P purrr * 0.9000.0.9000 2022-10-13 [?] Github (Tidyverse/purrr@af857c2)
P stringr * 1.4.1 2022-08-20 [?] CRAN (R 4.2.1)
P tidyr * 1.2.1 2022-09-08 [?] CRAN (R 4.2.1)
[1] /home/mar/Dev/Projects/R/ma-riviere.me/renv/library/R-4.2/x86_64-pc-linux-gnu
[2] /home/mar/Dev/Projects/R/ma-riviere.me/renv/sandbox/R-4.2/x86_64-pc-linux-gnu/9a444a72
P ── Loaded and on-disk path mismatch.
──────────────────────────────────────────────────────────────────────────────
1 Basic Operations
data.table general syntax
DT[row selector (filter/sort), col selector (select/mutate/summarize/rename), modifiers (group)]
Data
MT <- as.data.table(mtcars)
IRIS <- as.data.table(iris)[, Species := as.character(Species)]1.1 Arrange / Order
1.1.1 Basic ordering
1.1.2 Ordering with keys
- Keys physically reorders the dataset within the RAM (by reference)
- No memory is used for sorting (other than marking which columns is the key)
- No memory is used for sorting (other than marking which columns is the key)
- The dataset is marked with an attribute “sorted”
- The dataset is always sorted in ascending order, with NA first
- Using
keybyinstead ofbywhen grouping will set the grouping factors as keys
See this SO post for more information on keys.
To see over which keys (if any) the dataset is currently ordered:
Unless our task involves repeated subsetting on the same column, the speed gain from key-based subsetting could effectively be nullified by the time needed to reorder the data in RAM, especially for large datasets.
1.1.3 Ordering with (secondary) indices
-
setindexcreates an index for the provided columns, but doesn’t physically reorder the dataset in RAM.
- It computes the ordering vector of the dataset’s rows according to the provided columns in an additional attribute called index
data.table [32 x 11]
| [ omitted 17 entries ] |
We can see the additional index attribute added to the data.table:
names(attributes(MT))[1] "names" "row.names" "class"
[4] ".internal.selfref" "index"
We can get the currently used indices with:
indices(MT)[1] “cyl__gear”
Adding a new index doesn’t remove a previously existing one:
We can thus use indices to pre-compute the ordering for the columns (or combinations of columns) that we will be using to group or subset by frequently !
1.2 Subset / Filter
1.2.1 Basic filtering
mtcars |> filter(cyl >= 6 & disp < 180)data.frame [5 x 11]
1.2.2 Filter based on a range
1.2.3 Filter with a pattern
mtcars |> filter(str_detect(disp, "^\\d{3}\\."))data.frame [9 x 11]
MT[disp %like% "^\\d{3}\\."]data.table [9 x 11]
1.2.4 Filter on row number (slicing)
mtcars |> slice(1) # slice_head(n = 1)data.frame [1 x 11]
Slice a random sample of rows:
mtcars |> slice_sample(n = 5)data.frame [5 x 11]
MT[1]data.table [1 x 11]
MT[.N]data.table [1 x 11]
Slice a random sample of rows:
MT[sample(.N, 5)]data.table [5 x 11]
1.2.5 Filter distinct/unique rows
mtcars |> distinct(mpg, hp, .keep_all = TRUE)data.frame [31 x 11]
| [ omitted 16 entries ] |
Number of unique rows/values
n_distinct(mtcars$gear)[1] 3
1.2.6 Filter by keys
When keys or indices are defined, we can filter based on them, which is often a lot faster.
We do not even need to specify the column name we are filtering on: the values will be attributed to the keys in order.
setkey(MT, cyl)
MT[.(6)] # Equivalent to MT[cyl == 6]data.table [7 x 11]
setkey(MT, cyl, gear)
MT[.(6, 4)] # Equivalent to MT[cyl == 6 & gear == 4]data.table [4 x 11]
1.2.7 Filter by indices
To filter by indices, we can use the on argument, which creates a temporary secondary index on the fly (if it doesn’t already exist).
IRIS["setosa", on = "Species"]data.table [50 x 5]
| [ omitted 35 entries ] |
Since the time to compute the secondary indices is quite small, we don’t have to use setindex, unless the task involves repeated subsetting on the same columns.
When using on with multiple values, the nomatch = NULL argument avoids creating combinations that do not exist in the original data (i.e. for cyl == 5 here)
MT[.(4:6, 4), on = c("cyl", "gear"), nomatch = NULL]data.table [12 x 11]
1.2.8 Filtering on multiple columns
Filtering with one function taking multiple columns:
f_dat <- \(d) with(d, gear > cyl) # Function taking the data and comparing fix columns
f_dyn <- \(x, y) x > y # Function taking dynamic columns and comparing themcols <- c("gear", "cyl")Manually:
mtcars |> filter(f_dyn(gear, cyl))data.frame [2 x 11]
Dynamically:
Taking column names:
Taking the data:
Manually:
MT[f_dyn(gear, cyl),]data.table [2 x 11]
Dynamically:
Taking column names:
data.table [2 x 11]
Taking the data:
MT[f_dat(MT),] # Can't use .SD in idata.table [2 x 11]
In two steps:
We can’t use .SD in the i clause of a data.table, but we can bypass that constraint by doing the operation in two steps:
- Obtaining a vector stating if each row of the table matches or not the conditions
- Filtering the original table based on the vector
MT[MT[, f_dat(.SD)]]data.table [2 x 11]
Combining multiple filtering functions:
This function filters rows that have 2 or more non-zero decimals, and we’re going to call it on multiple columns:
decp <- \(x) str_length(str_remove(as.character(abs(x)), ".*\\.")) >= 2cols <- c("drat", "wt", "qsec")Manually:
MT[decp(drat) & decp(wt) & decp(qsec), ]data.table [13 x 11]
Dynamically:
In two steps:
1.3 Rename
setnames changes column names in-place
Manually:
mtcars |> rename(CYL = cyl, MPG = mpg)data.frame [32 x 11]
| [ omitted 17 entries ] |
Dynamically:
mtcars |> rename_with(\(c) toupper(c), .cols = matches("^d"))data.frame [32 x 11]
| [ omitted 17 entries ] |
1.4 Select
1.4.1 Basic selection
MT[, .(mpg, disp)]data.table [32 x 2]
| [ omitted 17 entries ] |
Alternatives
MT[ , .SD, .SDcols = c("mpg", "disp")]
MT[, .SD, .SDcols = patterns("mpg|disp")]Remove a column:
MT[, !"cyl"] # MT[, -"cyl"]data.table [32 x 10]
| [ omitted 17 entries ] |
In-place:
copy(MT)[, cyl := NULL][]data.table [32 x 10]
| [ omitted 17 entries ] |
Select & Extract:
mtcars |> pull(disp) [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
Select & Rename:
mtcars |> select(dispp = disp)data.frame [32 x 1]
| [ omitted 17 entries ] |
Select & Extrac:
MT[, disp] [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 275.8
[13] 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 304.0 350.0
[25] 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0
Select & Rename:
MT[, .(dispp = disp)]data.table [32 x 1]
| [ omitted 17 entries ] |
1.4.2 Dynamic selection
By name:
cols <- c("cyl", "disp")Removing a column:
mtcars |> select(!{{cols}}) # select(-matches(cols))data.frame [32 x 9]
| [ omitted 17 entries ] |
MT[, ..cols]data.table [32 x 2]
| [ omitted 17 entries ] |
Removing a column:
MT[, !..cols]data.table [32 x 9]
| [ omitted 17 entries ] |
In-place:
copy(MT)[, (cols) := NULL][]data.table [32 x 9]
| [ omitted 17 entries ] |
By pattern:
MT[, .SD, .SDcols = !patterns("^d")]data.table [32 x 9]
| [ omitted 17 entries ] |
MT[, .SD, .SDcols = \(x) all(x != 0)] # Only keep columns where no value == 0data.table [32 x 9]
| [ omitted 17 entries ] |
By column type:
iris |> select(where(\(x) !is.numeric(x)))data.frame [150 x 1]
| [ omitted 135 entries ] |
IRIS[, .SD, .SDcols = !is.numeric]data.table [150 x 1]
| [ omitted 135 entries ] |
1.5 Mutate / Transmute
data.table can mutate in 2 ways:
- Using = creates a new DT with the new columns only (like dplyr::transmute)
- Using := (or let) modifies the current dt in place (like dplyr::mutate)
The function modifying a column should be the same size as the original column (or group).
If only one value is provided with :=, it will be recycled to the whole column/group.
If the number of values provided is smaller than the original column/group:
- With := or let, an error will be raised, asking to manually specify how to recycle the values.
- With =, it will behave like dplyr::summarize (if a grouping has been specified).
1.5.1 Basic transmute
Only keeping the transformed columns.
mtcars |> transmute(cyl = cyl * 2)data.frame [32 x 1]
| [ omitted 17 entries ] |
MT[, .(cyl = cyl * 2)]data.table [32 x 1]
| [ omitted 17 entries ] |
Transmute & Extract:
MT[, (cyl = cyl * 2)] [1] 12 12 8 12 16 12 16 8 8 12 12 16 16 16 16 16 16 8 8 8 8 16 16 16 16
[26] 8 8 8 16 12 16 8
1.5.2 Basic mutate
Modifies the transformed column in-place and keeps every other column as-is.
1.5.3 Dynamic trans/mutate
LHS <- "mean_mpg"
RHS <- "mpg"data.table [32 x 12]
| [ omitted 17 entries ] |
data.table [32 x 12]
| [ omitted 17 entries ] |
1.5.4 Conditional trans/mutate
Mutate everything based on multiple conditions:
One condition:
data.frame [32 x 12]
| [ omitted 17 entries ] |
Nested conditions:
mtcars |> mutate(Size = case_when(
cyl %between% c(2,4) ~ "small",
cyl %between% c(4,8) ~ "BIG",
.default = "Unk"
))data.frame [32 x 12]
| [ omitted 17 entries ] |
Mutate only rows meeting conditions:
Mutate everything based on multiple conditions:
One condition:
data.table [32 x 12]
| [ omitted 17 entries ] |
Nested conditions:
copy(MT)[, Size := fcase(
cyl %between% c(2,4), "small",
cyl %between% c(4,8), "BIG",
default = "Unk"
)][]data.table [32 x 12]
| [ omitted 17 entries ] |
Mutate only rows meeting conditions:
copy(MT)[am == 1, BIG := cyl >= 6][]data.table [32 x 12]
| [ omitted 17 entries ] |
1.5.5 Complex trans/mutate
1.5.5.1 Column-wise operations
Apply one function to multiple columns:
data.frame [32 x 13]
| [ omitted 17 entries ] |
As a transmute:
data.frame [32 x 2]
| [ omitted 17 entries ] |
Dynamically:
Apply multiple functions to one or multiple column:
col <- "mpg"
cols <- c("mpg", "disp")data.frame [32 x 13]
| [ omitted 17 entries ] |
data.frame [32 x 13]
| [ omitted 17 entries ] |
Multiple columns:
data.frame [32 x 15]
| [ omitted 17 entries ] |
1.5.5.2 Row-wise operations
Apply one function to multiple columns (row-wise):
data.frame [32 x 12]
| [ omitted 17 entries ] |
data.frame [32 x 12]
| [ omitted 17 entries ] |
Hybrid base R-Tidyverse:
Apply multiple functions to multiple columns (row-wise)
Apply an anonymous function inside the DT:
1.6 Group / Aggregate
The examples listed apply a grouping but do nothing (using .SD to simply keep all columns as is)
1.6.1 Basic grouping
mtcars |> group_by(cyl, gear)data.frame [32 x 11]
| [ omitted 17 entries ] |
Dynamic grouping:
Use any_of if you expect some columns to be missing in the data.
1.6.2 Current group info
mtcars |>
group_by(cyl) |>
group_walk(\(d, g) with(d, plot(gear, mpg, main = paste("Cyl:", g$cyl))))1.7 Row numbers & indices
1.7.1 Adding row or group indices
.I: Row indices.N: Number of rows
.GRP: Group indices.NGRP: Number of groups
Adding rows indices:
mtcars |> mutate(I = row_number())data.frame [32 x 12]
| [ omitted 17 entries ] |
copy(MT)[ , I := .I][]data.table [32 x 12]
| [ omitted 17 entries ] |
Adding group indices:
Adding group indices (same index for each group):
mtcars |> group_by(cyl) |> summarize(GRP = cur_group_id())data.frame [3 x 2]
Mutate instead of summarize:
mtcars |> arrange(cyl) |> group_by(cyl) |> mutate(GRP = cur_group_id())data.frame [32 x 12]
| [ omitted 17 entries ] |
Adding row numbers within each group:
mtcars |> group_by(gear) |> mutate(I_GRP = row_number())data.frame [32 x 12]
| [ omitted 17 entries ] |
Adding group indices (same index for each group):
MT[, .GRP, by = cyl]data.table [3 x 2]
Mutate instead of summarize:
copy(MT)[, GRP := .GRP, keyby = cyl][]data.table [32 x 12]
| [ omitted 17 entries ] |
Adding row numbers within each group:
1.7.2 Filtering based on row numbers (slicing)
Extracting a specific row:
Slicing rows:
tail(mtcars, 10)data.frame [10 x 11]
data.frame [10 x 11]
mtcars |> slice_tail(n = 10)data.frame [10 x 11]
tail(MT, 10)data.table [10 x 11]
MT[(.N-9):.N]data.table [10 x 11]
MT[MT[, .I[(.N-9):.N]]] # Gets the last 10 rows' indices and filters based on themdata.table [10 x 11]
Slicing groups:
Random sample by group:
mtcars |> group_by(cyl) |> slice_sample(n = 5)data.frame [15 x 11]
Filter groups by condition:
data.frame [25 x 11]
| [ omitted 10 entries ] |
mtcars |> group_by(cyl) |> group_modify(\(d,g) if (nrow(d) >= 8) d else data.frame())data.frame [25 x 11]
| [ omitted 10 entries ] |
Random sample by group:
MT[, .SD[sample(.N, 5)], keyby = cyl]data.table [15 x 11]
Filter groups by condition:
MT[, if(.N >= 8) .SD, by = cyl]data.table [25 x 11]
| [ omitted 10 entries ] |
MT[, .SD[.N >= 8], by = cyl]data.table [25 x 11]
| [ omitted 10 entries ] |
1.7.3 Extracting row indices
Getting the row numbers of specific observations:
Row number of the first and last observation of each group:
data.frame [6 x 2]
… while keeping all other columns:
Extracting row indices after filtering:
.I gives the vector of row numbers after any subsetting/filtering has been done
Extracting row numbers in the original dataset:
mtcars |> mutate(I = row_number()) |> filter(gear == 4) |> pull(I)[1] 1 2 3 8 9 10 11 18 19 20 26 32
Extracting row numbers in the new dataset (after filtering):
mtcars |> filter(gear == 4) |> mutate(I = row_number()) |> pull(I)[1] 1 2 3 4 5 6 7 8 9 10 11 12
Extracting row numbers in the original dataset:
MT[, .I[gear == 4]][1] 1 2 3 8 9 10 11 18 19 20 26 32
Extracting row numbers in the new dataset (after filtering):
MT[gear == 4, .I][1] 1 2 3 4 5 6 7 8 9 10 11 12
1.8 Relocate
1.8.1 Basic reordering
mtcars |> group_by(cyl) |> mutate(GRP = cur_group_id(), .before = 1)data.frame [32 x 12]
| [ omitted 17 entries ] |
setcolorder(copy(MT)[ , GRP := .GRP, by = cyl], c("GRP"))[]data.table [32 x 12]
| [ omitted 17 entries ] |
setcolorder(copy(MT), c(setdiff(colnames(MT), "cyl"), "cyl"))[]data.table [32 x 11]
| [ omitted 17 entries ] |
1.8.2 Reordering by column names
setcolorder(copy(MT), sort(colnames(MT)))[]data.table [32 x 11]
| [ omitted 17 entries ] |
1.9 Summarize:
With data.table, one needs to use the = operator to summarize. It takes a function that returns a list of values smaller than the original column (or group) size. By default, it will only keep the modified columns (like a transmute).
1.9.1 Basic summary
MT[, .(mean_cyl = mean(cyl))]data.table [1 x 1]
1.9.2 Grouped summary
By default, dplyr::summarize will arrange the result by the grouping factor:
By default, data.table keeps the order the groups originally appear in:
MT[, .N, by = cyl]data.table [3 x 2]
To order by the grouping factor, use keyby instead of by:
MT[, .N, keyby = cyl]data.table [3 x 2]
Grouped on a temporary variable:
MT[, .N, by = .(cyl > 6)]data.table [2 x 2]
1.9.3 Column-wise summary
Apply one function to multiple columns:
mtcars |> group_by(cyl) |> summarize(across(everything(), mean))data.frame [3 x 11]
By column type:
By matching column names:
Applying multiple functions to one column:
Apply multiple functions to multiple columns:
Depending on the output we want (i.e. having the function’s output as columns or rows), we can either provide a list of functions to apply (list_of_fns), or a function returning a list (fn_returning_list).
One column per function, one row per variable:
mtcars |>
group_by(cyl) |>
summarize(map_dfr(across(all_of(cols)), fn_returning_list, .id = "Var")) |>
ungroup()data.frame [6 x 4]
One column per variable, one row per function:
mtcars |>
group_by(cyl) |>
summarize(map_dfr(list_of_fns, \(f) map(across(all_of(cols)), f), .id = "Fn")) |>
ungroup()data.frame [6 x 4]
One column per function/variable combination:
One column per function, one row per variable:
data.table [6 x 4]
One column per variable, one row per function:
MT[,
lapply(list_of_fns, \(f) lapply(.SD, f)) |> rbindlist(idcol = "Fn"),
keyby = cyl, .SDcols = cols
]data.table [6 x 4]
One column per function/variable combination:
MT[,
lapply(.SD, fn_returning_list) |>
unlist(recursive = FALSE), # do.call(c, args = _)
keyby = cyl, .SDcols = cols
]data.table [3 x 5]
Different column order & naming scheme:
MT[,
lapply(list_of_fns, \(f) lapply(.SD, f)) |>
unlist(recursive = FALSE), # do.call(c, args = _)
keyby = cyl, .SDcols = cols
]data.table [3 x 5]
Using dcast (see next section):
dcast(MT, cyl ~ ., fun.agg = list_of_fns, value.var = cols) # list(mean, sd)data.table [3 x 5]
2 Pivots
2.1 Melt / Longer
Data:
FAM1data.table [5 x 5]
FAM2data.table [5 x 8]
One group of columns –> single value column
FAM1 |> pivot_longer(cols = matches("dob_"), names_to = "variable")data.frame [15 x 4]
data.table [15 x 4]
FAM1 |> melt(measure.vars = patterns("^dob_"))data.table [15 x 4]
One group of columns –> multiple value columns
FAM1 |> melt(measure.vars = patterns(child1 = "child1$", child2 = "child2$|child3$"))data.table [10 x 5]
2.1.1 Merging multiple yes/no columns:
Melting multiple presence/absence columns into a single variable:
movies_widedata.frame [3 x 4]
pivot_longer(
movies_wide, -ID, names_to = "Genre",
values_transform = \(x) ifelse(x == 0, NA, x), values_drop_na = TRUE
) |> select(-value)data.frame [6 x 2]
2.1.2 Partial pivot:
Multiple groups of columns –> Multiple value columns
Manually:
colA <- str_subset(colnames(FAM2), "^dob")
colB <- str_subset(colnames(FAM2), "^gender")
FAM2 |> melt(measure.vars = list(colA, colB), value.name = c("dob", "gender"), variable.name = "child")data.table [15 x 5]
FAM2 |> melt(measure.vars = list(a, b), value.name = c("dob", "gender"), variable.name = "child") |>
substitute2(env = list(a = I(str_subset(colnames(FAM2), "^dob")), b = I(str_subset(colnames(FAM2), "^gender")))) |> eval()data.table [15 x 5]
Using .value:
Using the .value special identifier allows to do a “half” pivot: the values that would be listed as rows under .value are instead used as columns.
FAM2 |> pivot_longer(cols = matches("^dob|^gender"), names_to = c(".value", "child"), names_sep = "_child")data.frame [15 x 5]
FAM2 |> melt(measure.vars = patterns("^dob", "^gender"), value.name = c("dob", "gender"), variable.name = "child")data.table [15 x 5]
Using measure and value.name:
data.table only
FAM2 |> melt(measure.vars = measure(value.name, child = \(x) as.integer(x), sep = "_child"))data.table [15 x 5]
FAM2 |> melt(measure.vars = measurev(list(value.name = NULL, child = as.integer), pattern = "(.*)_child(\\d{1})"))data.table [15 x 5]
2.2 Dcast / Wider:
General idea:
- Pivot around the combination of id.vars (LHS of the formula)
- The measure.vars (RHS of the formula) are the ones whose values become column names
- The value.var are the ones the values are taken from to fill the new columns
Data:
data.table [15 x 4]
(FAM2L <- FAM2 |> melt(measure.vars = measure(value.name, child = \(.x) as.integer(.x), sep = "_child")))data.table [15 x 5]
Basic pivot wider:
FAM1L |> pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")data.frame [5 x 5]
FAM1L |> dcast(family_id + age_mother ~ variable)data.table [5 x 5]
Using all the columns as IDs:
By default, id_cols = everything()
FAM1L |> pivot_wider(names_from = variable)data.frame [5 x 5]
... => “every unused column”
FAM1L |> dcast(... ~ variable)data.table [5 x 5]
Multiple value columns –> Multiple groups of columns:
FAM2L |> pivot_wider(
id_cols = c("family_id", "age_mother"), values_from = c("dob", "gender"),
names_from = "child", names_sep = "_child"
)data.frame [5 x 8]
data.table [5 x 8]
data.table [5 x 8]
Dynamic names in the formula:
var_name <- "variable"FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = {{ var_name }})data.frame [5 x 5]
data.table [5 x 5]
FAM1L |> dcast(family_id + age_mother ~ v1) |> substitute2(env = list(v1 = var_name)) |> eval()data.table [5 x 5]
Multiple variables:
id_vars <- c("family_id", "age_mother")FAM1L |> pivot_wider(id_cols = all_of(id_vars), names_from = variable)data.frame [5 x 5]
FAM1L |> dcast(v1 + v2 ~ variable) |> substitute2(env = list(v1 = id_vars[1], v2 = id_vars[2])) |> eval()data.table [5 x 5]
2.2.1 Renaming (prefix/suffix) the columns:
FAM1L |> pivot_wider(names_from = variable, values_from = value, names_prefix = "Attr: ")data.frame [5 x 5]
FAM1L |> pivot_wider(names_from = variable, values_from = value, names_glue = "Attr: {variable}")data.frame [5 x 5]
2.2.2 Unused combinations:
The logic is inverted between dplyr (keep) and data.table (drop)
FAM1L |> pivot_wider(names_from = variable, values_from = value, id_expand = TRUE, names_expand = FALSE) # (keep_id, keep_names)data.frame [25 x 5]
| [ omitted 10 entries ] |
data.table [25 x 5]
| [ omitted 10 entries ] |
2.2.3 Subsetting:
AFAIK, pivot_wider can’t do this on it’s own.
FAM1L |> filter(value >= lubridate::ymd(20030101)) |>
pivot_wider(id_cols = c("family_id", "age_mother"), names_from = "variable")data.frame [3 x 5]
2.2.4 Aggregating:
Not specifying the column holding the measure vars (the names) will result in an empty column counting the number of columns that should have been created for all the measures.
FAM1L |> dcast(family_id + age_mother ~ .)data.table [5 x 3]
We can customize that default behavior using the fun.aggregate argument:
Here, we count the number of child for each each combination of (family_id + age_mother) -> sum all non-NA value
FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(.x) sum(!is.na(.x))) |>
rowwise() |> mutate(child_count = sum(c_across(matches("_child")))) |> ungroup()data.frame [5 x 6]
FAM1L |> pivot_wider(id_cols = c(family_id, age_mother), names_from = variable, values_fn = \(.x) sum(!is.na(.x))) |>
mutate(child_count = apply(select(cur_data(), matches("_child")), 1, \(r) sum(r)))data.frame [5 x 6]
(FAM1L |> dcast(family_id + age_mother ~ ., fun.agg = \(.x) sum(!is.na(.x))) |> setnames(".", "child_count"))data.table [5 x 3]
Applying multiple fun.agg:
Data:
(DTL <- data.table(
id1 = sample(5, 20, TRUE),
id2 = sample(2, 20, TRUE),
group = sample(letters[1:2], 20, TRUE),
v1 = runif(20),
v2 = 1L)
)data.table [20 x 5]
| [ omitted 5 entries ] |
Multiple fun.agg applied to one variable:
data.table [8 x 6]
Multiple fun.agg to multiple value.var (all combinations):
data.table [8 x 10]
Multiple fun.agg and multiple value.var (one-to-one):
Here, we apply sum to v1 (for both group a & b), and mean to v2 (for both group a & b)
2.2.5 One-hot encoding:
Making each level of a variable into a presence/absence column:
movies_longdata.frame [6 x 3]
pivot_wider(
movies_long, names_from = "Genre", values_from = "Genre",
values_fn = \(x) !is.na(x), values_fill = FALSE
)data.frame [6 x 5]
3 Joins:
In data.table, a JOIN is just another type of SUBSET: we subset the rows of one data.table with the rows of a second one, based on some conditions that define the type of JOIN.
Matching two tables based on their rows can be done:
- Either on equivalences (equi-joins)
- Or functions comparing one row to another (non-equi joins)
Data:
(DT1 <- data.table(
ID = LETTERS[1:10],
A = sample(1:5, 10, replace = TRUE),
B = sample(10:20, 10)
))data.table [10 x 3]
(DT2 <- data.table(
ID = LETTERS[5:14],
C = sample(1:5, 10, replace = TRUE),
D = sample(10:20, 10)
))data.table [10 x 3]
Basic (right) join example:
right_join(
DT1 |> select(ID, A),
DT2 |> select(ID, C),
by = "ID"
) |> as_tibble()data.frame [10 x 3]
DT1[DT2, .(ID, A, C), on = .(ID)]data.table [10 x 3]
3.1 Outer (right, left):
Appends data of one at the end of the other.
data.table doesn’t do left joins natively
Subsetting DT1 by DT2:
DT2 (everything) + DT1 (all columns, but only the rows that match those in DT1).
> Looking up DT1’s rows using DT2 (or DT2’s key, if it has one) as an index.
As a right join:
right_join(DT1, DT2, by = "ID") # DT1 into DT2data.table [10 x 5]
DT1[DT2, on = .(ID)]data.table [10 x 5]
As a left join:
Not exactly equivalent to the right join: same columns, but DT2 is first instead of DT1
Subsetting DT2 by DT1:
DT1 (everything) + DT2 (all columns, but only the rows that match those in DT1).
> Looking up DT2’s rows using DT1 (or DT1’s key, if it has one) as an index.
As a right join:
right_join(DT2, DT1, by = "ID") # DT2 into DT1data.table [10 x 5]
DT2[DT1, on = .(ID)]data.table [10 x 5]
As a left join:
Not exactly equivalent to the right join: same columns, but DT1 is first instead of DT2
3.2 Full (outer):
full_join(DT1, DT2, by = "ID")data.table [14 x 5]
data.table::merge.data.table(DT1, DT2, by = "ID", all = TRUE)data.table [14 x 5]
Alternatively:
3.3 Inner:
Only returns the ROWS matching both tables:
- Inner: rows matching both DT1 and DT2, columns of both (add DT2’s columns to the right)
- Semi: rows matching both DT1 and DT2, columns of first one
Inner:
inner_join(DT1, DT2, by = "ID") data.table [6 x 5]
DT1[DT2, on = .(ID), nomatch = NULL]data.table [6 x 5]
Semi:
semi_join(DT1, DT2, by = "ID")data.table [6 x 3]
DT1[na.omit(DT1[DT2, on = .(ID), which = TRUE])]data.table [6 x 3]
which = TRUE returns the row numbers instead of the rows themselves.
3.4 Anti:
ROWS of DT1 that are NOT in DT2, and only the columns of DT1.
ROWS of DT2 that are NOT in DT1, and only the columns of DT2.
3.5 Non-equi joins:
DT1[DT2, on = .(ID, A <= C)]data.table [10 x 4]
3.6 Rolling joins:
DT1[DT2, on = "ID", roll = TRUE]data.table [10 x 5]
Inverse the rolling direction:
DT1[DT2, on = "ID", roll = -Inf]data.table [10 x 5]
DT1[DT2, on = "ID", rollends = TRUE]data.table [10 x 5]
4 Tidyr & Others:
4.1 Remove NA:
na.omit(IRIS, cols = str_subset(colnames(IRIS), "Sepal"))data.table [150 x 5]
| [ omitted 135 entries ] |
4.2 Unite:
Combine multiple columns into a single one:
4.3 Extract / Separate:
Separate a row into multiple columns based on a pattern (extract) or a separator (separate):
MT.ext <- MT[, .(x = str_c(gear, carb, sep = "_"))]4.4 Separate rows:
Separate a row into multiple rows based on a separator:
Data
(SP <- data.table(
val = c(1,"2,3",4),
date = as.Date(c("2020-01-01", "2020-01-02", "2020-01-03"), origin = "1970-01-01")
)
)data.table [3 x 2]
SP |> tidyr::separate_rows(val, sep = ",", convert = TRUE)data.frame [4 x 2]
Solution 1:
data.table [4 x 2]
Solution 2:
data.table [4 x 2]
Solution 3:
(With type conversion)
SP[, unlist(tstrsplit(val, ",", type.convert = TRUE)), by = val][SP, on = "val"][, let(val = V1, V1 = NULL)][]data.table [4 x 2]
Solution 4:
data.table [4 x 2]
(With type conversion)
4.5 Duplicates:
4.5.1 Duplicated rows:
Finding duplicated rows:
data.frame [2 x 11]
MT[, if(.N > 1) .SD, by = .(mpg, hp)]data.table [2 x 11]
Only keeping non-duplicated rows:
This is different from distinct/unique, which will keep one of the duplicated rows of each group.
This removes all groups which have duplicated rows.
Solution 1:
data.frame [30 x 11]
| [ omitted 15 entries ] |
MT[, if(.N == 1) .SD, by = .(mpg, hp)]data.table [30 x 11]
| [ omitted 15 entries ] |
Solution 2:
More convoluted
data.frame [30 x 11]
| [ omitted 15 entries ] |
MT[!MT[, if(.N > 1) .SD, by = .(mpg, hp)], on = colnames(MT)]data.table [30 x 11]
| [ omitted 15 entries ] |
fsetdiff(MT, setcolorder(MT[, if(.N > 1) .SD, by = .(mpg, hp)], colnames(MT)))data.table [30 x 11]
| [ omitted 15 entries ] |
4.5.2 Duplicated values (per row):
(DUPED <- data.table(
A = c("A1", "A2", "B3", "A4"),
B = c("B1", "B2", "B3", "B4"),
C = c("A1", "C2", "D3", "C4"),
D = c("A1", "D2", "D3", "D4")
)
)data.table [4 x 4]
DUPED |> mutate(Repeats = apply(cur_data(), 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", ")))data.table [4 x 5]
DUPED[, Repeats := apply(.SD, 1, \(r) r[which(duplicated(r))] |> unique() |> str_c(collapse = ", "))][]data.table [4 x 5]
With duplication counter:
dup_counts <- function(v) {
rles <- as.data.table(unclass(rle(v[which(duplicated(v))])))[, lengths := lengths + 1]
paste(apply(rles, 1, \(r) paste0(r[2], " (", r[1], ")")), collapse = ", ")
}DUPED[, Repeats := apply(.SD, 1, \(r) dup_counts(r))][]data.table [4 x 5]
4.6 Expand & Complete:
Here, we are missing an entry for person B on year 2010, that we want to fill:
(CAR <- data.table(
year = c(2010,2011,2012,2013,2014,2015,2011,2012,2013,2014,2015),
person = c("A","A","A","A","A","A", "B","B","B","B","B"),
car = c("BMW", "BMW", "AUDI", "AUDI", "AUDI", "Mercedes", "Citroen","Citroen", "Citroen", "Toyota", "Toyota")
)
)data.table [11 x 3]
4.6.1 Expand:
tidyr::expand(CAR, person, year)data.frame [12 x 2]
CJ(CAR$person, CAR$year, unique = TRUE)data.table [12 x 2]
4.6.2 Complete:
Joins the original dataset with the expanded one:
CAR |> tidyr::complete(person, year)data.frame [12 x 3]
CAR[CJ(person, year, unique = TRUE), on = .(person, year)]data.table [12 x 3]
4.7 Uncount:
Duplicating aggregated rows to get the un-aggregated version back
Data
cols <- c("Mild", "Moderate", "Severe")
dat_aggdata.frame [10 x 6]
dat_agg |>
tidyr::pivot_longer(cols = cols, names_to = "Severity", values_to = "Count") |>
tidyr::uncount(Count) |>
mutate(ID_new = row_number(), .after = "ID") |>
tidyr::pivot_wider(
names_from = "Severity", values_from = "Severity",
values_fn = \(x) ifelse(is.na(x), 0, 1), values_fill = 0
)data.frame [23 x 7]
| [ omitted 8 entries ] |
Solution 1:
(melt(DAT_AGG, measure.vars = cols, variable.name = "Severity", value.name = "Count")
[rep(1:.N, Count)][, ID_new := .I]
|> dcast(... ~ Severity, value.var = "Severity", fun.agg = \(x) ifelse(is.na(x), 0, 1), fill = 0)
|> DT(, -"Count")
)data.table [23 x 7]
| [ omitted 8 entries ] |
Solution 2:
4.8 List / Unlist:
When a column contains a simple vector/list of values (of the same type, without structure)
4.8.1 One listed column:
Single ID (grouping) column:
data.frame [3 x 2]
(MT_LIST <- MT[, .(mpg = .(mpg)), keyby = cyl])data.table [3 x 2]
Solution 1:
mtcars_list |> unnest(mpg)data.frame [32 x 2]
| [ omitted 17 entries ] |
MT_LIST[, .(mpg = unlist(mpg)), keyby = cyl]data.table [32 x 2]
| [ omitted 17 entries ] |
Solution 2:
Bypasses the need of grouping when unlisting by growing the data.table back to its original number of rows before unlisting.
data.table [32 x 2]
| [ omitted 17 entries ] |
Multiple ID (grouping) columns:
data.frame [8 x 3]
(MT_LIST2 <- MT[, .(mpg = .(mpg)), keyby = .(cyl, gear)])data.table [8 x 3]
Solution 1:
mtcars_list2 |> unnest(mpg) # group_by(cyl, gear) is optionaldata.frame [32 x 3]
| [ omitted 17 entries ] |
data.table [32 x 3]
| [ omitted 17 entries ] |
Solution 2:
Same as with one grouping column
4.8.2 Multiple listed column:
Creating the data:
(mtcars_list_mult <- mtcars |> group_by(cyl, gear) |> summarize(across(c(mpg, disp), \(c) list(c))) |> ungroup())data.frame [8 x 4]
data.table [8 x 4]
Solution 1:
4.9 Nest / Unnest:
When a column contains a data.table/data.frame (with multiple columns, structured)
4.9.1 One nested column:
Nesting
(mtcars_nest <- mtcars |> tidyr::nest(data = -cyl)) # Data is inside a tibbledata.frame [3 x 2]
(MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl])data.table [3 x 2]
Unnesting
4.9.2 Multiple nested column:
Nesting:
(mtcars_nest_mult <- mtcars |> group_by(cyl, gear) |> nest(data1 = c(mpg, hp), data2 = !c(cyl, gear, mpg, hp)) |> ungroup())data.frame [8 x 4]
(MT_NEST_MULT <- MT[, .(data1 = .(.SD[, .(mpg, hp)]), data2 = .(.SD[, !c("mpg", "hp")])), keyby = .(cyl, gear)])data.table [8 x 4]
Unnesting:
4.9.3 Operate on nested/list columns:
data.frame [3 x 2]
(MT_NEST <- MT[, .(data = .(.SD)), keyby = cyl])data.table [3 x 2]
Creating a new column using the nested data:
Keeping the nested column:
data.frame [3 x 3]
data.table [3 x 3]
Dropping the nested column:
data.frame [3 x 2]
data.table [3 x 2]
Creating multiple new columns using the nested data:
mtcars_nest |> group_by(cyl) |> group_modify(\(d, g) linreg(unnest(d, everything()))) |> ungroup()data.frame [6 x 6]
data.table [6 x 6]
Operating inside the nested data:
mtcars_nest |>
mutate(data = map(data, \(tibl) mutate(tibl, sum = pmap_dbl(cur_data(), sum)))) |>
unnest(data)data.frame [32 x 12]
| [ omitted 17 entries ] |
mtcars_nest |>
mutate(across(data, \(tibls) map(tibls, \(tibl) mutate(tibl, sum = apply(cur_data(), 1, sum))))) |>
unnest(data)data.frame [32 x 12]
| [ omitted 17 entries ] |
Using the nplyr package:
4.10 Rotate / Transpose:
Solution 1:
Using pivots to fully rotate the data.table:
MT_SUMMARY |>
pivot_longer(!cyl, names_to = "Statistic") |>
pivot_wider(id_cols = "Statistic", names_from = "cyl", names_prefix = "Cyl ")data.frame [6 x 4]
MT_SUMMARY |>
melt(id.vars = "cyl", variable.name = "Statistic") |>
dcast(Statistic ~ paste0("Cyl ", cyl))data.table [6 x 4]
Solution 2:
library(datawizard)
datawizard::data_rotate(MT_SUMMARY, colnames = TRUE, rownames = "Statistic")data.frame [6 x 4]
data.table::transpose(MT_SUMMARY, keep.names = "Statistic", make.names = 1)data.table [6 x 4]
5 Processing examples:
Examples of interesting tasks that I’ve collected over time.
5.1 Find minimum in each group:
MT[, .SD[which.min(mpg)], keyby = cyl]data.table [3 x 11]
5.2 GROUP > FILTER > MUTATE
Data:
(DAT <- structure(list(
id = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L),
name = c("Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Jane", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob", "Bob"),
year = c(1980L, 1981L, 1982L, 1983L, 1984L, 1985L, 1986L, 1987L, 1985L, 1986L, 1987L, 1988L, 1989L, 1990L, 1991L, 1992L),
job = c("Manager", "Manager", "Manager", "Manager", "Manager", "Manager", "Boss", "Boss", "Manager", "Manager", "Manager", "Boss", "Boss", "Boss", "Boss", "Boss"),
job2 = c(1L, 1L, 1L, 1L, 1L, 1L, 0L, 0L, 1L, 1L, 1L, 0L, 0L, 0L, 0L, 0L)
),
.Names = c("id", "name", "year", "job", "job2"),
class = "data.frame",
row.names = c(NA, -16L)
) |> setDT())data.table [16 x 5]
| [ omitted 1 entries ] |
Tidyverse:
DAT |> group_by(name, job) |>
filter(job != "Boss" | year == min(year)) |>
mutate(cumu_job2 = cumsum(job2)) |>
ungroup()data.frame [11 x 6]
Here, the grouping is done BEFORE the filter -> there will be empty groups, meaning they will sum to 0
data.table:
Solution 1:
data.table [11 x 3]
Solution 2:
data.table [11 x 3]
Solution 3:
DAT[DAT[, .I[job != "Boss" | year == min(year)], by = .(name, job)]$V1
][, cumu_job2 := cumsum(job2), by = .(name, job)][]data.table [11 x 6]
If we filtered after the grouping:
5.3 GROUP > SUMMARIZE > JOIN > MUTATE
Data:
(GSJM1 <- data.table(x = c(1,1,1,1,2,2,2,2), y = c("a", "a", "b", "b"), z = 1:8, key = c("x", "y")))data.table [8 x 3]
(GSJM2 <- data.table(x = 1:2, y = c("a", "b"), mul = 4:3, key = c("x", "y")))data.table [2 x 3]
Tidyverse:
as.data.frame(GSJM1) |>
group_by(x, y) |>
summarise(z = sum(z)) |>
ungroup() |>
right_join(GSJM2) |>
mutate(z = z * mul) |>
select(-mul)data.frame [2 x 3]
data.table:
Basic:
Advanced (using .EACHI):
GSJM1[GSJM2, .(z = sum(z) * mul), by = .EACHI]data.table [2 x 3]
5.4 Separating rows & cleaning text:
Data
(DT_COMA <- data.table(
first = c(1,"2,3",3,4,5,6.5,7,8,9,0),
second = c(1,"2,,5",3,4,5,"6,5,9",7,8,9,0),
third = c("one", "two", "thr,ee", "four", "five", "six", "sev,en", "eight", "nine", "zero"),
fourth = as.Date(c(1/1/2020, 2/1/2020, 3/1/2020, 4/1/2020, 5/1/2020, 6/1/2020, 7/1/2020, 8/1/2020, 9/1/2020, 10/1/2020), origin = "1970-01-01")
)
)data.table [10 x 4]
5.4.1 Step1: Cleaning
Removing unwanted commas within words
Tidyverse:
DT_COMA |> mutate(across(where(\(v) is.character(v) & all(is.na(as.numeric(v)))), \(v) str_remove_all(v, ",")))data.table [10 x 4]
data.table:
cols_to_clean <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & all(is.na(as.numeric(v)))] |> colnames()
copy(DT_COMA)[, c(cols_to_clean) := map(.SD[, cols_to_clean, with = F], \(v) str_remove_all(v, ","))][]data.table [10 x 4]
5.4.2 Step 2: Separating rows
Each numeric row that has multiple comma-separated values has to be split into multiple rows (one value per row)
Tidyverse:
cols_to_separate <- DT_COMA |> select(where(\(v) is.character(v) & any(!is.na(as.numeric(v))))) |> colnames()
reduce(
cols_to_separate,
\(acc, col) acc |> tidyr::separate_rows(col, sep = ",", convert = T),
.init = DT_COMA
)data.frame [17 x 4]
| [ omitted 2 entries ] |
data.table:
cols_to_separate <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & any(!is.na(as.numeric(v)))] |> colnames()
(reduce(
cols_to_separate,
\(acc, col) acc[rep(1:.N, lengths(strsplit(get(col), ",")))][, (col) := type.convert(unlist(strsplit(acc[[col]], ",", fixed = T)), as.is = T, na.strings = "")],
.init = DT_COMA
))[]data.table [17 x 4]
| [ omitted 2 entries ] |
5.4.3 Combining both steps:
Tidyverse:
DT_COMA <- DT_COMA |> mutate(across(where(\(v) is.character(v) & all(is.na(as.numeric(v)))), \(v) str_remove_all(v, ",")))
reduce(
select(DT_COMA, where(\(v) is.character(v) & any(!is.na(as.numeric(v))))) |> colnames(),
\(acc, col) acc |> tidyr::separate_rows(col, sep = ",", convert = T),
.init = DT_COMA
)data.frame [17 x 4]
| [ omitted 2 entries ] |
data.table:
cols_to_clean <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & all(is.na(as.numeric(v)))] |> colnames()
cols_to_separate <- DT_COMA[, .SD, .SDcols = \(v) is.character(v) & any(!is.na(as.numeric(v)))] |> colnames()
DT_COMA[, c(cols_to_clean) := map(.SD[, cols_to_clean, with = F], \(v) str_remove_all(v, ","))]data.table [10 x 4]
(reduce(
cols_to_separate,
\(acc, col) acc[rep(1:.N, lengths(strsplit(get(col), ",")))][, (col) := type.convert(unlist(strsplit(acc[[col]], ",", fixed = T)), as.is = T, na.strings = "")],
.init = DT_COMA
))[]data.table [17 x 4]
| [ omitted 2 entries ] |
5.5 Multiple choice questions:
Data:
survdata.frame [5 x 2]
Here we will spread the answers into their own columns using a pivot because not all rows have all the possible answers:
Tidyverse:
surv |>
mutate(response = str_split(response, fixed("|"))) |>
unnest(response) |>
pivot_wider(id_cols = ID, names_from = response, values_from = response, values_fn = \(.x) sum(!is.na(.x)), values_fill = 0)data.frame [5 x 6]
data.table:
5.6 Filling with lagging conditions:
Task: See this SO question.
Data:
ZIP <- structure(
list(
zipcode = c(1001, 1002, 1003, 1004, 1101, 1102, 1103, 1104, 1201, 1202, 1203, 1302),
areacode = c(4, 4, NA, 4, 4, 4, NA, 1, 4, 4, NA, 4),
type = structure(c(1L, 1L, NA, 1L, 2L, 2L, NA, 1L, 1L, 1L, NA, 1L), .Label = c("clay", "sand"), class = "factor"),
region = c(3, 3, NA, 3, 3, 3, NA, 3, 3, 3, NA, 3),
do_not_fill = c(1, NA, NA, 1, 1, NA, NA, 1, NA, NA, NA, 1)
),
class = c("data.table", "data.frame"), row.names = c(NA, -4L)
)Tidyverse:
as_tibble(ZIP) |>
mutate(type = as.character(type)) |>
mutate(
across(1:4, ~ ifelse(
is.na(.) & lag(areacode) == lead(areacode) &
lag(as.numeric(substr(zipcode, 1, 2))) == lead(as.numeric(substr(zipcode, 1, 2))),
lag(.), .
)
)
)data.frame [12 x 5]
data.table:
5.7 Join + Coalesce:
Task: Replace the missing dates from one dataset with the earliest date from another dataset, matching by ID:
Data:
(dt1 <- data.table::fread(
"
id x y z
1 A 1 NA
2 C 3 NA
3 C 3 NA
4 C 2 NA
5 B 2 2019-08-04
6 C 1 2019-09-18
7 B 3 2019-12-17
8 A 2 2019-11-02
9 A 3 2020-03-16
10 A 1 2020-01-31
"
))data.table [10 x 4]
(dt2 <- data.table::fread(
" id date
1 2012-09-25
1 2012-03-26
1 2012-11-12
2 2013-01-24
2 2012-05-04
2 2012-02-24
3 2012-05-30
3 2012-02-15
4 2012-03-13
4 2012-05-18
"))data.table [10 x 2]
Tidyverse:
Using coalesce:
left_join(
dt1,
dt2 |> group_by(id) |> summarize(date = min(date)),
by = "id"
) |> mutate(date = coalesce(z, date), z = NULL)data.table [10 x 4]
Using the rows_* functions:
rows_patch(
dt1 |> rename(date = z),
dt2 |> group_by(id) |> summarize(date = min(date)),
by = "id"
)data.table [10 x 4]
data.table:
As a right join:
copy(dt2)[, .(date = min(date)), by = id
][dt1, on = "id"][, let(date = fcoalesce(date, z), z = NULL)][]data.table [10 x 4]
As a left join:
5.8 Join on multiple columns (partial matching):
Task: Join both tables based on matching IDs, but the IDs are split between multiple columns in one table (id1 & id2).
(dt1 <- data.table(id = c("ABC", "AAA", "CBC"), x = 1:3))data.table [3 x 2]
(dt2 <- data.table(
id1 = c("ABC", "AA", "CB"),
id2 = c("AB", "AAA", "CBC"),
y = c(0.307, 0.144, 0.786))
)data.table [3 x 3]
Solution 1:
Combine the two ID columns into one with pivot_longer, then join:
dt2 |> pivot_longer(matches("^id"), names_to = NULL, values_to = "id") |> right_join(dt1)data.frame [3 x 3]
melt(dt2, measure.vars = patterns("^id"), value.name = "id")[, variable := NULL][dt1, on = "id"]data.table [3 x 3]
Solution 2:
Combine the two ID columns into one with unite + separate_rows, then join:
(From @TimTeaFan
dt2 |> unite("id", id1, id2, sep = "_") |> separate_rows("id") |> right_join(dt1)data.frame [3 x 3]
copy(dt2)[, id := paste(id1, id2, sep = "_")
][, c(V1 = strsplit(id, "_", fixed = TRUE), .SD), by = id
][, let(id = V1, V1 = NULL, id1 = NULL, id2 = NULL)
][dt1, on = "id"]data.table [3 x 3]
Solution 3:
Join on one of the two columns (id2 here), and then fill in (patch) the missing values:
left_join(dt2, dt1, by = c("id2" = "id")) |>
rows_patch(rename(dt1, id1 = id), unmatched = "ignore")data.table [3 x 4]
5.9 Merging rows across multiple columns (every X rows):
Data:
(BANK <- data.table(
date = c("30 feb", "NA", "NA", "NA", "31 feb", "NA", "NA", "NA"),
description = c("Mary", "had a", "little", "lamb", "Twinkle", "twinkle", "little", "star"),
withdrawal = c("100", "NA", "NA", "NA", "NA", "NA", "NA", "NA"),
deposit = c("NA", "NA", "NA", "NA", "100", "NA", "NA", "NA")
)[, lapply(.SD, \(c) type.convert(c, as.is = T))]
)data.table [8 x 4]
merge_and_convert <- function(v) {
type.convert(v, as.is = T) |> na.omit() |>
paste(collapse = " ") |> type.convert(as.is = T) |>
bind(x, ifelse(is.logical(x), as.integer(x), x))
}Tidyverse:
Solution 1:
mutate(BANK, ID = ceiling(seq_along(row_number())/4)) |>
group_by(ID) |>
summarize(across(everything(), \(m) merge_and_convert(m)))data.frame [2 x 5]
Solution 2:
summarize(BANK, across(
everything(),
\(c) sapply(split(c, ceiling(seq_along(c)/4)), \(m) merge_and_convert(m))
))data.frame [2 x 4]
data.table:
data.table [2 x 4]
5.10 Tagging successive events:
Tagging repeated blocks of events (aka run length encoding):
data.frame [15 x 1]
DAT <- as.data.table(dat)Manually:
data.frame [15 x 2]
data.frame [15 x 2]
Tidyverse:
dat |> mutate(ID = consecutive_id(event))data.frame [15 x 2]
data.table:
DAT[, ID := rleid(event)][]data.table [15 x 2]
6 Miscellaneous:
6.1 Keywords:
.SD
.I, .N
.GRP, .NGRP
.BY
.EACHI
6.2 Useful functions:
fsetdiff, fintersect, funion and fsetequal (apply to data.tables instead of vectors)
nafill, fcoalesce
as.IDate

Citation
@online{rivière2022,
author = {Marc-Aurèle Rivière},
title = {Tidyverse \textless-\textgreater{} Data.table},
date = {2022-05-19},
url = {https://ma-riviere.me/content/posts/data.table},
langid = {en},
abstract = {This document is a collection of notes I took while
learning to use `data.table`, summarizing the equivalences between
most `dplyr`/`tidyr` verbs and `data.table`.}
}